The Logistics Supply Chain ETL with 3rd-Party APIs project is a robust ETL pipeline that integrates shipment, tracking, and delivery data from carriers like ShipStation, DHL, and FedEx into a unified MySQL analytics warehouse. It uses Python to pull data from REST APIs, normalizes heterogeneous responses into a star schema (fact + dimensions), orchestrates incremental updates via Apache Airflow DAGs, and builds SLA monitoring tables for on-time delivery analytics. The platform handles thousands of daily events per carrier, runs incremental loads, and was delivered in about 6 months, ahead of schedule. This solution gives operations teams clear visibility into shipment performance, delays, and carrier reliability.
The system follows a classic API-driven ETL architecture with orchestration for reliability:
Data Sources: REST APIs from ShipStation, DHL, and FedEx.
Ingestion Layer: Python scripts (requests-based) pulling shipment, tracking, and delivery events.
Processing Layer: Python + pandas normalize and standardize fields into a common model.
Storage Layer: MySQL database with fact_shipment and dim_carrier plus SLA tables.
Orchestration Layer: Airflow DAGs schedule and manage incremental ETL runs per carrier. Tasks are logically split per carrier and step (extract → transform → load → SLA refresh).
Star Schema in MySQL:
dim_carrier: carrier_id, carrier_name, api_endpoint, etc.
fact_shipment: shipment_id, carrier_id, status, expected_delivery, actual_delivery, and other shipment attributes.
SLA Monitoring Tables: Derived tables (e.g., sla_monitoring) aggregate metrics such as total shipments per carrier, on-time vs late deliveries, and on-time percentage. Indexes on IDs and dates support efficient querying.
Extract: Python jobs call APIs using secure credentials. Incremental pulls are based on timestamps or last_updated fields.
Transform: pandas normalizes nested JSON. A mapping layer standardizes fields across carriers (e.g., standardizing status names and delivery timestamps).
Load: Batch loads to MySQL via SQLAlchemy using UPSERT patterns for incremental updates. Facts and dimensions are handled in separate flows.
SLA Monitoring: SQL queries compute performance metrics and populate SLA monitoring tables, refreshed via Airflow DAGs.
Project Start: June 1, 2025 | Duration: ~6 months (Delivered ahead of schedule)
Testing: Unit Tests with Pytest for API wrappers; Integration Tests with mock APIs and end-to-end Airflow runs; Performance Tests handling ~10k events in under 10 minutes.
Deployment: Airflow via Docker/Kubernetes; MySQL cloud-hosted; all configurations managed via Git and environment variables.
Monitoring: Airflow UI for DAG health; MySQL logs for query performance. Success target: ~99% accuracy reconciliation between APIs and MySQL.
Maintenance: Daily incremental refresh, monthly backups, and rotation of API keys. Estimated Cost: ~$200/month for infrastructure.
Methodology: Agile with 2-week sprints, starting with a POC for one carrier.